SQL Data Cleaning Project: Global Layoffs Dataset
SQL Data Cleaning Project: Global Layoffs Dataset
This project demonstrates my proficiency in SQL data cleaning by transforming a raw global layoffs dataset into a clean, standardized, and analysis-ready format. The process involved handling duplicates, standardizing inconsistent entries, managing missing values, and optimizing the dataset for future insights.
Problem Statement
The initial layoffs dataset was plagued with common data quality issues such as duplicate records, varied spellings for company names and locations, inconsistent date formats, and null values in critical columns. These issues would have severely hampered any accurate analysis. My objective was to resolve these inconsistencies to ensure data integrity and reliability.
Tools Used
- SQL (MySQL)
Data Cleaning Process
1. Removing Duplicate Records
To ensure the uniqueness of each record and prevent skewed analytical results, I identified and removed duplicate rows. This was achieved by leveraging SQL's window functions, specifically `ROW_NUMBER()` with a `PARTITION BY` clause across all relevant columns, to assign a unique identifier to each distinct row and then deleting rows with duplicate identifiers.
-- Identify and remove duplicate rows
WITH duplicates_cte AS
(
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, `date`, stage, country, funds_raised_millions) as row_num
FROM layoffs_staging
)
DELETE
FROM layoffs_staging2
WHERE row_num > 1 ;
2. Standardizing Data Inconsistencies
Inconsistent data entries, such as variations in company names (' Google' vs 'Google') or industry categories ('Crypto', 'Crypto Currency'), and non-standardized date formats, were addressed. I used `TRIM()` to remove leading/trailing spaces, `UPDATE` statements with `LIKE` clauses to unify categorical data, and `STR_TO_DATE()` followed by `ALTER TABLE` to convert dates to a uniform `DATE` data type.
-- Trim whitespace from company names
UPDATE layoffs_staging2
SET company = TRIM(company);
-- Standardize 'Crypto' industry variations
UPDATE layoffs_staging2
SET industry = 'Crypto'
WHERE industry LIKE 'Crypto%';
-- Standardize 'United States' country variations
UPDATE layoffs_staging2
SET country = 'United States'
WHERE country LIKE 'United States%';
-- Convert and standardize date format
UPDATE layoffs_staging2
SET `date` = STR_TO_DATE (`date`,'%m/%d/%Y');
ALTER TABLE layoffs_staging2
MODIFY COLUMN `date` DATE;
3. Handling Missing Values
To ensure the completeness of our dataset, particularly for the 'industry' column, I implemented a strategy to fill in missing values. By self-joining the table on the 'company' column, I was able to populate null 'industry' values for a company if a non-null industry existed for that same company in another record.
-- Fill in missing industry values using self-join
UPDATE layoffs_staging2 AS t1
JOIN layoffs_staging2 AS t2
ON t1.company = t2.company
SET t1.industry = t2.industry
WHERE t1.industry IS NULL
AND t2.industry IS NOT NULL;
4. Removing Irrelevant Rows and Columns
Finally, I removed rows that contained no relevant layoff information (where both `total_laid_off` and `percentage_laid_off` were null) as they contribute no value to analysis. Additionally, the temporary `row_num` column, used during the duplicate removal process, was dropped to finalize the cleaned dataset.
-- Remove rows with no layoff data
DELETE
FROM layoffs_staging2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;
-- Drop the temporary row_num column
ALTER TABLE layoffs_staging2
DROP COLUMN row_num;
Results & Impact
The completion of these data cleaning steps resulted in a robust, reliable, and analysis-ready dataset. This meticulously cleaned data is now perfectly suited for exploratory data analysis, enabling accurate insights into global layoff trends, industry-specific impacts, and geographical distributions. It forms a solid foundation for any subsequent reporting, visualization, or machine learning tasks.
Learnings and Takeaways
This project significantly enhanced my understanding of common data quality issues and reinforced my ability to apply advanced SQL techniques for data manipulation. Key learnings include:
- Mastering window functions for data deduplication.
- Effective use of string functions (`TRIM`, `LIKE`) for data standardization.
- Strategic application of self-joins for intelligent missing value imputation.
- The critical importance of a systematic approach to data cleaning as the foundation for reliable data analysis.
Next Steps
With the data now clean, the next logical steps would involve performing in-depth Exploratory Data Analysis (EDA) to uncover patterns and trends, and building interactive dashboards using tools like Tableau or Power BI to visualize key insights for stakeholders.
Project information
- Category Data Cleaning
- Tools SQL (MySQL)
- Project date July 2025
- Project Link GitHub Repository